-
Notifications
You must be signed in to change notification settings - Fork 204
Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4231f1d
to
b0d1490
Compare
Thank you very much, @santoshborse. I tested the transform by running
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with this transform is that it has no tests:
$ make test-src
...
source venv/bin/activate; \
export PYTHONPATH=../src:../: ; \
cd test; pytest -s .
/bin/bash: line 3: cd: test: No such file or directory
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/cma/de/data-prep-kit/transforms
configfile: pyproject.toml
plugins: cov-6.0.0, anyio-4.8.0
collected 0 items
=========================================================================================================== no tests ran in 0.01s ===========================================================================================================
make: *** [../../../.make.defaults:442: .defaults.test-src] Error 5
Please add some tests to the transform.
transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform.py
Outdated
Show resolved
Hide resolved
transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_python.py
Outdated
Show resolved
Hide resolved
transforms/universal/tokenization2arrow/dpk_tokenization2arrow/transform_ray.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the content from here ? https://github.com/IBM/data-prep-kit/blob/tokenization2arrow_transform/transforms/Dockerfile.ray.template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, it may be better to use ray.transform. This will make it easier for maintaining the module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because we have transform_python
which we use as,
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
I makes more sense to have transform_ray
, but if you insist I will change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santoshborse The latest convention is to use dpk_<transform>.runtime (instead of dpk_.transform_python) and dpk_<transform>.ray.runtime . It will be nice if you can stay with the convention. It makes suppport easier and there does not seem to be a good reason not to.
# TODO: check if we should add anything to tokenization_metadata | ||
return [(bos.getvalue().to_pybytes(), ".arrow")], tokenization_metadata | ||
|
||
def transform_binary(self, file_name: str, byte_array: bytes) -> tuple[list[tuple[bytes, str]], dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear why this is implementing transform_binary() and not transform(). I think the code structure will be easier to understand/maitain if you implement transform() and then call super.transform() before calling transforms_to_arrow() .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform_binary
returns - tuple[list[tuple[bytes, str]], dict[str, Any]]:
transform
returns - tuple[list[pa.Table], dict[str, Any]]:
I am using transform_binary
so that I can return data in bytes
( so that f/w can write .arrow files )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide one or more Unit Test in the test folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss how we can redo this one. Maybe 2 requirements.txt, one that is used as part of the packaging and one that is used for pulling the dependency on the tokenization. Also, have you considered making this module as part of the tokenization module ? Would it be easier for inheritance for this module to be an extension on the tokenization rather than its own ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the module to pyproject.toml for when building the wheel.
Thanks for adding the notebook, @santoshborse! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santoshborse This file is missing the following:
TRANSFORM_PYTHON_SRC=
TRANSFORM_RAY_SRC=
see https://github.com/IBM/data-prep-kit/blob/dev/transforms/Makefile.transform.template for example. This is also related to the comment above on transform_python and transform_ray. Let's follow the convention since there is no really good reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will make changes, I followed https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/Makefile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @touma-I I have updated the module names and Makefile as you asked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santoshborse Thank you! This looks good. Sorry about the confusion. I am hoping in the next iteration we will simplify things further and get rid of a few constraints. Please stay tuned. I might also reach out to bounce off a few ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santoshborse The two additional comments I added should address the failed CI/CD. Please reach out if anything is not clear.
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Maroun Touma <[email protected]> Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Maroun Touma <[email protected]> Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
686fa79
to
d201335
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santoshborse Thank you! This looks good. Sorry about the confusion. I am hoping in the next iteration we will simplify things further and get rid of a few constraints. Please stay tuned. I might also reach out to bounce off a few ideas.
@shahrokhDaijavad Can you review the readme.md and notebooks before I merge? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santoshborse In the two notebooks, can you please change
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
to
from dpk_tokenization2arrow.runtime import Tokenization2Arrow
and
from dpk_tokenization2arrow.transform_ray import Tokenization2Arrow
to
from dpk_tokenization2arrow.ray.runtime import Tokenization2Arrow
Also,
In the makefile, don't we need to make the same change for the 2 targets:
dpk_$(TRANSFORM_NAME).transform_python => dpk_$(TRANSFORM_NAME).runtime
and
dpk_$(TRANSFORM_NAME).transform_ray => dpk_$(TRANSFORM_NAME).ray.runtime
After these changes, if you don't mind, I will make a few small changes in the README file myself.
Signed-off-by: Santosh Borse <[email protected]>
@shahrokhDaijavad updated the notebook and makefile, also tested locally to make sure notebook and both commands in makefile works. |
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @santoshborse. I made a few changes to the README file.
Why are these changes needed?
Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files
Related issue number (if any).
#1009